Credit Card Services: Strategies for retention of customers

Introduction

Credit Card fees have been a considerable source of revenue for Thera Bank. The bank recently saw a steep decline in existing customers using their credit cards. This will lead to loss of revenue for the bank. The current aim is to improve the credit card services so that customer retention stabilizes or has growth.

Background: Banks charge credit card fees based on circumstances, or regardless of usage scenarios. These fees may be of types annual fees, balance transfer fees, late fees, foreign transaction fees and so on and provide an appreciable source of revenue for banks.

Objective

  1. Identifying which customers are likely to continue using credit cards with the bank.
  2. Developing strategies to improve the credit card services that will help retention of existing customers.
  3. Generating insights and recommendations for the bank for credit card products.

Data Information/Variables

Clientnum - unique number for customer holding account
Attrition_Flag - has value closed/attrited customer else existing customer
Customer_Age - age of customer in years
Gender - gender of account holder
Dependent_Count - Number of dependents
Education_Level - has values Graduate, High School, Unknown
Marital_Status - marital status of account holder
Income_Category - annual income category of account holder
Card_Category - type of credit card the account holder has
Months_on_book - period of relationship with the bank
Total_Relationship_Count - no. of products held by customer
Month_Inactive_12_mon - no. of months inactive in the last year
Contacts_Count_12_mon - no. of contacts between customer and bank in 12 months
Credit_Limit - credit limit on the credit card
Total_Revolving_Bal - balance that carries from one month to next
Avg_Open_To_Buy - refers to amount left on credit card to use (average of last 12 months)
Total_Trans_Amt - total transaction amount in last 12 months
Total_Trans_Ct - total transactions (count) in last 12 months
Total_Ct_Chng_Q4_Q1 - ratio of total transaction count in 4th quarter and the total transaction count in 1st quarter
Total_Amt_Chng_Q4_Q1 - ratio of total transaction amount in 4th quarter and total transaction amount in 1st quarter
Avg_Utilization_Ratio - represents how much of the available credit the customer spent

The analysis below has the following sections:

  1. Loading and importing packages
  2. Removing warnings from python notebooks
  3. Loading the dataset
  4. Preview of the dataset
  5. Descriptive statistics for the dataset
  6. Exploratory Data Analysis - univariate, bivariate and multivariate analysis
  7. Data preprocessing/model preparation - feature engineering, preparation of data for modeling, missing value treatment, outlier treatment
  8. Logistic Regression models - Logistic regression, logistic regression with oversampling, logistic regression with undersampling
    Decision Tree models - Decision Tree classifier, decision tree with oversampling, decision tree with undersampling
    Bagging models - bagging classifier, bagging with oversampling, bagging with undersampling
    Random forest models - random forest classifier, random forest with oversampling, random forest with undersampling
    Adaboost models - adaboost classifier, adaboost with oversampling, adaboost with undersampling
    Gradient descent models - gradient descent classifier, gradient descent with oversampling, gradient descent with undersampling
  9. Hypertuning the best 3 models from above analysis
  10. Hypertuning with randomized search for best 3 models from above analysis
  11. Building pipeline with the best model
  12. Insights and recommendations for the marketing department

1. Loading and importing packages

2. Removing warnings from python notebook

3. Loading the dataset

4. Previewing the dataset

Observation

The data has 10,127 rows and 20 columns. At first glance, we see that majority of the columns are continous variables. Some columns have "NaN"s.

The random rows shows presence of "NaN" as seen above for Marital_Status column. Also Income_Category has "abc" values which do not fit with the other values for the column. The target variable for this analysis will be "Attrition Customer" which has values "Existing Customer" and "Attrited Customer". We can also observe that "CLIENTNUM" column are basically account numbers for the customers and will not have any useful information for building the models. We will drop the "CLIENTNUM" column in the data pre-processing section.

Observation

From above, we observe that there are only 2 columns that have missing values - "Education_Level" and "Marital_Status". The target variable "Attrition_Flag" has no missing values. Multiple columns are of the type float, int or object. 2 columns will require missing value treatment which will be tackled in the data preprocessing section.

Observation

The column "Education_Level" has maximum number of missing values - 1519 rows. Marital_Status has 749 missing values. We may be able to use the Customer_Age groups and Income_category groups to fill in the missing values for Education_level (using median for those groups). For Marital_Status missing values, it may be possible to group by age and then take the mode value (since it is categorical variable) and use that to impute the missing values for Marital_Status column.

5. Descriptive statistics for the dataset

Observations

  1. The mean customer age is around 46 years old, and 75% of the customers are less than 52 years old. The mean and median for Customer_Age is the same, so Customer_Age is going to be a mostly symmetric distribution.
  2. The customers have 2 dependents on an average. Since the mean Dependent_count is almost same as the median Depedent_count, the distribution for Dependent_Count is also going to be mostly symmetric.
  3. The Months_on_book indicates the period of relationship with the bank. The mean is 35.9 months, while the median is 36 months. 75% of the customers have less than 40 months period of relationship with the bank - which is less than 2 years.
  4. The Total_Relationship_Count, which is the number of bank products the customer has, on average 3.8 products. The median is 4 products, so this column's distribution is mostly symmetric as well. 75% of the customers have about 5 bank products.
  5. Customers have been inactive on an average of 2.34 months (median 2 months ) last year. 75% of the customers were inactive 3 months last year.
  6. The number of counts the customer had contact with the bank is on average of 2.4 times in the past 12 months. The median is 2 times in the past year.
  7. The Credit_Limit for 10,127 customers was a mean of 8631 dollars. The median credit limit, however, was 4549 dollars indicating a skewed distribution.75% of the customers had a Credit_Limit of 11,057 dollars. The column is being interpreted as dollars instead of thousands of dollars since it is Credit_Limit.
  8. The mean Total_Revolving_Bal was 1162.81 dollars and the median balance was 1276 dollars. The minimum balance was 0$ and maximum was 2517 dollars.
  9. The average amount of money left in the credit card over the last 12 months (Avg_Open_To_Buy) was 7469 dollars and the mean was 3474 dollars.The minimum was 3 dollars and the maximum was 34516 dollars. The differences between median and mean indicate that this will be a skewed distribution.
  10. The Total_Amt_Chng_Q4_Q1 is the ratio of total transaction amount in 4th quarter and total transaction amount in 1st quarter. The mean was 0.75 and the mean was 0.73.
  11. The Total_Ct_Chng_Q4_Q1 is ratio of total transaction count in 4th quarter and the total transaction count in 1st quarter. The mean was 0.71 and the median was 0.702.
  12. The Total_Trans_Amt had a mean of 4404.08 and 3899 - this is the total transaction amount in the last 12 months. The minimum was 510 and the maximum was 18,484 which is a large range - however, 75 percent of the customers had 4741 as their total transaction amount in the last 12 months.
  13. The Total_Trans_Ct had a mean of 64 and a median of 67. The minimum number of transactions in the last 12 months was 10 and maximum was 139.
  14. The Avg_Utilization_Ratio which is how much of the available credit was spent by the customer - the mean was 0.27 and median was 0.176.

Observations

  • The target/dependent variable "Attrition_Flag" has only 2 unique values with the top frequency value being "Existing Customer". Since the total number of count is 10,127 this indicates 83.9% of the customers have "Existing Customer" flag. This implies an imbalanced dataset.
  • Gender only has 2 unique values with 5358 (52.9%) of them are "F" - which will stand for female gender.
  • Education_Level has 1519 missing values, as described earlier. There are 6 types of unique values. The top frequency value is graduate which is 30 percent of the customer base.
  • Marital_status has 3 unique values - this column will be further explored in depth to see if any of the groups can be clubbed together.
  • The Income_Category has 6 unique values as well, with 35% of the customers having an income less than 40K per annum.
  • Card_Category is the type of credit card product that the customer has. There are 4 unique values, and 93.1% of the customers have credit card of the type "Blue".
  • Observations

    The percentage of customers who remained at the bank is 83.9% indicating that approximately 16% of customers left the bank.

    Observations

    For 'Attrition_Flag' we can see that majority of the customers (~83%) are existing customers. However, we want to improve the Existing Customer rate for the bank.

    52 percent of the customers are female, while 48 percent are male. This is more of less an even distribution

    Observations

    There are 3 categories for 'Marital_Status' - married customers formed the largest group (46.2%) while single customers were the next biggest group.

    Observations

    The bank offered 4 types of cards - 'Blue' had the largest group of customers (93.1%) followed by 'Silver', 'Gold'. Very few customers (20 customers) had 'Platinum'.

    Observations

    30% of the customers are graduates. The next highest group are high school which are 19.8%. The doctorate group is the smallest group in the customer base.

    For income category, the largest group of customers earn less than 40K per annum. The lowest group is 120K+ per annum. There is also a category called "abc" which will have to be treated in data preprocessing section, as it looks to be a data collection error. In the event where it is possible to talk to the data collection team, the exact category description for "abc" can be ascertained. However, here since we have no contact with a data collection group, we will have to treat "abc" category.

    6. Exploratory Data Analysis - univariate, bivariate and multivariate analysis

    Observations

  • The range for Credit_Limit is from 0 to 35,000 dollars. However, except for about 500 customers who have a Credit_Limit of 35,000 dollars, a lot of customers are grouped around less than 5000 dollars for Credit_Limit. Hence, the distribution is right-skewed (tail is on the right). The customers grouped around 35,000 dollars are unlikely to be outliers as they are 500 in count, and we have observed that there is a card category of "Platinum" and it is likely that "Platinum" might have a higher Credit_Limit as compared to "Blue" category.
  • For the box plot of Credit_Limit and Attrition_Flag, we can observe that the medians between the "Existing Customer" and "Attrited Customer" is not significantly different.
  • However, for the box plot of Credit_Limit versus Gender, the median Credit_Limit for females is considerably less than median for the male gender. The points that lie outside the whisker are more in number for the female category versus none for the male category. From the descriptive statistics section, we saw that 52% of all customers are females.
  • When we plot Credit_Limit versus Education_Level as shown in the box plot above, we can see that the medians Credit_Limit among all education groups are similar. There are data points beyond the whiskers for the top part of the box plot. However, it is likely that these are not outliers as there might be a card category with higher credit limits.
  • Looking at the Credit_Limit versus Card_Category, we can see that the range for the categories are considerably different.50% of the Blue category customers are between ~ 3000 and 9000 dollars for Credit_Limit. However, 50% of the Gold category customers are between 18,000 and 35,000 dollars. 50% of the Silver category are between 15,000 and 35,000 approximately. The Platinum category customers are grouped between 32,000 and 35,000 Credit_Limit range.
  • For Credit_Limit versus Marital_Status, the medians for each category are similar, and the credit limit range is similar too.
  • Observations

  • Customer_Age is a symmetric distribution with a mean and median around 46 years.
  • The ranges and medians are quite similar when it comes to how old the customers are and their gender, education level, which type of card they chose, or their marital status. There were also similar in terms of whether they were existing customers or attrited customers.
  • Observations

  • The Months_on_book variable - which indicates the period of relationship the customer has with the bank is mostly symmetric distribution, however, there is a large peak around 35 months (approximately).
  • The box plots of Months_on_book with regards to Attrition_Flag, Gender, Education_Level, Card_Category, Marital_Status are similar - the ranges and medians are similar between the categories in each plot.
  • Observations

  • The Total_Revolving_Bal is the balance that carries over from one month to the next. There are peaks around 0 dollars and 2500 dollars with the peak at 0 dollars being larger.
  • When it comes to Total_Revolving_Bal with respect to Attrition_Flag - we find that there are differences between the range and medians between existing customers and attrited customers. 50 percent of existing customers have a Total_Revolving_Bal of 800 to 1800 dollars. However, 50 percent of attrited customers have a Total_Revolving_Bal of 0 dollars to approximately 1200 dollars (which is less than the median of the Total_Revolving_Bal of the existing customers).
  • While the median Total_Revolving_Bal is similar for males and females, the range for females is 0 to 1800 while for males it is 600 to 1800.
  • With respect to education level, the medians are similar in the education groups for Total_Revolving_Bal. The College and Doctorate group has a lower limit of 0 dollars as compared to other groups.
  • Overall, the medians for Total_Revolving_Bal are similar among the card categories.
  • In the Marital_Status box plot, while the medians are similar among the married, single and divorced groups - the single customer group had a lower limit of 0 dollars for Total_Revolving_Bal.
  • Observations

  • The Avg_Open_To_Buy refers to the amount left on the credit card to use is a right skewed distribution (the tail is on the right). The range is 0 to 35,000 dollars.
  • The range and medians for existing customers versus attrited customers for Avg_Open_To_Buy are similar.
  • With respect to gender, there are considerable data points that lie beyond the upper whisker for the female category. The medians between the male and female categories for Avg_Open_To_Buy are different - with the female category having a lower median as compared to the male category.
  • The medians and ranges are similar for Avg_Open_To_Buy with respect to education level.
  • For card category, the median Avg_Open_To_Buy is considerably less than the medians for Gold, silver, Platinum category. The data points outside of the upper whisker are also more for Blue category. The platinum category also has couple of outliers, but on the lower whisker side.
  • The medians, range and pattern of outliers are similar for the last box plot which is for Marital_Status and Avg_Open_To_Buy.
  • Observations

    The distribution for Total_Trans_Amt (which is the total transaction amount in the last 12 months) has multiple peaks, and the range is 0 to 17500. When it comes to Attrition_Flag - the median Total_Trans_Amt is different and lower as compared to the median for existing customers. There was some difference between the medians for Total_Trans_Amt for gender, but not significantly different. For education level, the median and ranges for Total_Trans_Amt are similar. The same observation can be made for Marital_Status. However, when it comes to card category - the median Total_Trans_Amt for Platinum category is different than the median Total_Trans_Amt for Blue category. This can be attributed to the Platinum category having a higher credit limit.

    Observations

    The Avg_Utilization_Ratio is a right-skewed distribution (tail is on the right). The Avg_Utilization_Ratio represents how much of the credit the customer spent. There are differences between the median Avg_Utilization_Ratio between existing customer versus attrited customer with the attrited customers have a lower range and median of Avg_Utilization_Ratio. The median Avg_Utilization_Ratio are similar among groups of gender, education level, card category, and their marital status groups. With respect to range, we can observe that for blue category the range of Avg_Utilization_Ratio is larger as compared to that of Gold, Silver or Platinum. This can be interpreted that Blue customers are utilizing their credit to a larger extent than Gold, Silver, or Platinum customers. One of the factors driving this may be that the credit limit for Blue customers is much less as compared to Gold, Silver, Platinum category.

    Observations

    From the above catplot, we can observe that the largest group of existing customers have Blue card type product. But also the Blue category had the largest number of attrited customers.

    Observations

    The females were higher in Blue card category for both existing and attrited customers.

    Observations

    The customers who formed the largest group for existing customers as well as attrited customers were Graduates.

    Observations

    The largest number of customers for existing customer and attrited customer were customers who had a dependent count of 2 and 3.

    Observations

    With respect to income category, the largest group of customers in the existing customer category were customers who had a less than 40,000 dollars per annum. The less than 40K per annum group was also the largest among the atrited customers.

    Observation

    We can observe that the female gender were in the less than 40 thousand dollars per annum category for married, single and divorced females. The 80k-120K category was higher among married, single, divorced males.

    Observations

    Married graduates formed the largest customer group for existing customers while married graduates as well as single graduates formed the largest group for attrited customers as seen above.

    Observations

    The attrited customers for platinum and silver card categories were slightly higher in age as compared to existing customers. While in the gold card category, the attrited customers were slightly lower in age as compared to existing customers. For the blue category, the ages of the customers were similar.

    Observations

    The pair plot gives us a bird's eye view - we can observe that customer_Age is symmetric, some other continous variables like Credit_Limit are skewed, and some distributions have multiple peaks. We can also see some correlation between variables, and we will use a correlation heatmap to evaluate these correlations.

    Observations

    The first observation we can make is that the Avg_Open_To_Buy is strongly correlated to Credit_Limit (1.00). Hence, we will drop the Avg_Open_To_Buy column in the data preprocessing section. There are two more sets that have high correlation - Total_Trans_Amt with Total_Trans_Ct (0.81), and Customer_Age and Months_on_book (0.79). Since they are highly correlated, we will drop columns Total_Trans_Ct and Months_on_book in the data pre-processing section. Columns Total_Revolving_Bal and Avg_Utilization_Ratio have a correlation of 0.62 (moderate correlation). The Total_Revolving_Bal represents the balance that carries over from one month to the next, while the Avg_Utilization_Ratio represents how much available credit the customer spent. Based on their definitions and the moderate correlation score, we will drop the Avg_Utilization_Ration in the data preprocessing section.

    The Avg_Open_To_Buy has a moderage negative correlation to Avg_Utilization_Ratio (0.54), 0.48 between Avg_Utilization_Ratio and Credit_Limit, Total_Ct_Chng_Q4_Q1 to Total_Amt_Chng_Q4_Q1 (0.38). A negative weak correlation between Total_Trans_Amt and Total_Relationship_Count (-0.35) and Total_Trans_Ct and Total_Relationship_Count (-0.24).

    7. Data pre-processing section/model preparation section

    This section will have the following subsections:

    1. Treating "abc" values in Income_Category variable
    2. Evaluation of outliers
    3. Dropping columns that are correlated from dataset
    4. Missing value treatment without data leakage/splitting the dataset/encoding of target variable
    5. Encoding of categorical variables
    7.1 Treating "abc" values in Income_Category variable

    Observations

    There are 3 columns in the data set that have missing values - Education_Level (1519 values), Marital_Status(749 values), and Income_Category (1112 values).

    Observations

    The "abc" has been successfully replaced by NaN values

    7.2 Evaluation of Outliers

    During the EDA process, we observed that several of the variables such as Credit_Limit with Attrition_Flag showed outliers (data points outside the upper whiskers in above example). However, the customers have different card categories (Blue, Gold, Silver, Platinum) and these card categories have different credit limits (Gold, Silver, Platinum - all three have an upper limit of 35,000 dollars). Thus, data points that are showing up beyond the upper whisker with Credit_Limit values between 25,000 and 35,000 are likely to not be data collection errors/extreme values.Hence, for analysis of this particular data set, we will not treat outliers as we will consider them to not be errors.

    7.3 Dropping correlated columns

    Observations

    The original dataset had 20 columns, and we are dropping 4 correlated ones - so the resulting data set is 10127 x 16.

    7.4 Missing value treatment without data leakage, encoding the target variable

    7.5 Encoding of the categorical variables (excluding target variable) after missing value imputation

    Observations

    This concludes the data preprocessing and model preparation section. The final split is 6075 rows and 25 columns for training set, 2026 rows and 25 columns for validation and testing sets.

    8. Model building and performance evaluation

    Note about model predictions

    The dependent variable in this analysis is the Attrition_Flag.

    If Attrition_Flag = Attrited Customer, we have assigned Attrition_Flag = 1. If Attrition_Flag = Existing Customer, we have assigned Attrition_Flag = 0.

    To evaluate which performance metric is most valuable for the analysis, we have to consider

    (a) Predicting that a customer will be an attrited customer, but the customer stayed - loss of opportunity (false positive)

    (b) Predicting that a customer will stay, but the customer attrited - loss of revenue (false negative)

    For the bank, loss of revenue has a higher cost than loss of opportunity. Hence, we want to minimize false negatives, and thus, we shall use recall (sensitivity) as our most important performance metric for the models.

    Note about model building

    There will be 6 models built in this section and there will be 3 steps:

    Step 1: Models built - (a) Logistic Regression (b) Decision Tree (c) Random Forest (d) Bagging (e) Adaboost (f) gradient descent. All the models will use kfold, and recall will be the performance metric for evaluating performance on training set and validation set.

    Step 2: Models with oversampling (SMOTE) - (a) Logistic Regression (b) Decision Tree (c) Random Forest (d) Bagging (e) Adaboost (f) gradient descent. All the models will use kfold, and recall will be the performance metric for evaluating performance on training set and validation set.

    Step 3 : Models with undersampling - (a) Logistic Regression (b) Decision Tree (c) Random Forest (d) Bagging (e) Adaboost (f) gradient descent. All the models will use kfold, and recall will be the performance metric for evaluating performance on training set and validation set.

    Step 4 :Evaluate performance and choose the best 3 performing models for next section.

    Observations

    The above data set is an imbalanced dataset (83% are existing customers, 17 percent are attrited customers).

  • Logistic regression - The recall scores are extremely low for logistic regression models (influenced by an imbalanced data set). It is 34.2% for training, and 33.3% for validation set. This indicates underfitting of the model where the model is not capturing enough information.
  • Decision Tree - if we look at training performance, decision tree has 100% for recall scores. This indicates overfitting (since decision tree is not pruned here) and most likely the model will perform poorly on the testing set. Since we use kfold, we can look at the cross validation scores - and decision tree has a recall of 72.5% as compared to 100% on the training set. This points to a poorly performing, not generalized, overfit model.
  • Random forest - random forest is similar to decision tree in terms of training performance (100%). On the cross validation section, it is having a recall of 65.4%. This indicates an overfit model that is not generalized.
  • Bagging - Bagging model had a training score of 97.5 and 72.3% on the cross validated set. This huge difference indicates that the model is not generalized as well as overfit.
  • Adaboost - Adaboost had a training score of 75% and 73% for cross validated score. This indicates that the model is generalizing well, however, we do want to improve our recall scores.
  • Gradient boosting - gradient boosting model had a performance of 81.7% for training and 75% on the cross validated set. This points to the model not generalizing well, and recall scores should be improved.
  • Observations

    Adaboost is giving the best recall scores so far, followed by gradient boosting and decision tree. Logistic regression performed the poorest of all the 6 models. The next step will be to address the imbalance in the data and see if that improves the recall scores for the 6 models.

    Observations

    Oversampling techniques (SMOTE) was used to address the issue of imbalanced dataset.

  • Logistic regression - The recall scores improved for logistic regression models. It is 77.3% for training, and 78.8% approximately for the set used for cross validation. This indicates that the model is fairly generalized.
  • Decision Tree - if we look at training performance, decision tree has 100% for recall scores. This indicates overfitting and most likely the model will perform less than 100% on the testing set. Since we use kfold, we can look at the cross validation scores - and decision tree has a recall of 93.9% as compared to 100% on the training set.
  • Random forest - random forest is similar to decision tree in terms of training performance (100%). On the cross validation section, it is having a recall of 96.23%. This indicates a model that is more generalized as compared to the random forest model in step 1.
  • Bagging - Bagging model had a training score of 99.7 and 95.33% on the cross validated set. The model is performing well but there is still considerable difference between training and validation sets.
  • Adaboost - Adaboost had a training score of 94.35% and 94.33% for cross validated score. This indicates that the model is generalizing well, as well as giving reasonable recall scores.
  • Gradient boosting - gradient boosting model had a performance of 96.52% for training and 95.62% on the cross validated set. This points to the model generalizing well, and good recall scores as well.
  • Observations

    With oversampling techniques, the plot shows that random forest performed the best with one outlier. The next best performing models was bagging and gradient boosting.

    Observations

    Above dataset is with undersampling techniques.

  • Logistic regression - The recall scores dropped as compared to oversampling. It is 73.46% for training, and 73.97% approximately for the set used for cross validation. This indicates that the model is fairly generalized.
  • Decision Tree - if we look at training performance, decision tree has 100% for recall scores. This indicates overfitting and most likely the model will perform less than 100% on the testing set. Since we use kfold, we can look at the cross validation scores - and decision tree has a recall of 86.06% as compared to 100% on the training set.
  • Random forest - random forest is similar to decision tree in terms of training performance (100%). On the cross validation section, it is having a recall of 89.65%. This indicates a model that is not generalized as compared to the random forest model in step 2.
  • Bagging - Bagging model had a training score of 99.18 and 90.06% on the cross validated set. The model is performing well but there is still considerable difference between training and validation sets.
  • Adaboost - Adaboost had a training score of 91.29% and 89.34% for cross validated score. This indicates that the model is generalizing well, as well as giving reasonable recall scores (though less than oversampling models).
  • Gradient boosting - gradient boosting model had a performance of 97.13% for training and 92.82% on the cross validated set. This points to the model generalizing well, but there is considerable difference between training and validation set.
  • Observations

    We can see presence of outliers with the undersampling techniques. The best performing models were gradient boosting, adaboost and bagging.

    9. Hypertuning the best 3 models

    Based on the cross validation and training scores and whether the model is generalizing well - the best 3 models from above are (a) Gradient boosting with oversampling [96.52% for training and 95.62% on the cross validated set, (b) Adaboost with oversampling [training score of 94.35% and 94.33% for cross validated score],and (c) Bagging with oversampling -[training score of 99.7 and 95.33% on the cross validated set]. Random forest with oversampling was a close third, but a 100% fit on training performance indicates an overfit model. The 3 models over here will be used for GridSearch hypertuning to find the best parameters.

    Gradient boosting with oversampling - hypertuning with GridSearch CV

    Observations

    The hyperparameter tuning was performed for gradient boosting with oversampled data (to correct for imbalance). The n_estimators indicates the number of sequential weak learner trees that will be modeled, the default is usually 100. The subsample is the fraction of samples that will be used for fitting the individual base learners. According to the scikit learn documentation, choosing subsample of less than 1.0 leads to a reduction of variance (which may lead to increase in bias). The max_features is the number of features when looking for the best split. The time for GridSearch CV in this case was 2 min 51 seconds where it found the best parameters were 250 for number of weak learner trees, subsample of 0.8 and max_features of 0.9.

    Observations

    The gradient boosting model with oversampled data had a recall of 98.11% and overall F1 score of 98.13%. The accuracy was 98.13% and precision was 98.15% as well.

    Observations

    True Positive: The model predicted customer will attrite, and customer attrited. This is 2.67% for the tuned gradient boost model which is very low. False Positive: The model predicted that customer will attrite, and customer did not attrite. This is where predicted = 1, but observed = 0. This is 14.26% above on the validation set. True Negative: The model predicted that the customer will not attrite, and the customer did not attrite. This is where predicted = 0, and observed = 0. This is 69.64%. False Negative: The model predicted that the customer will not attrite, and the customer attrited. This is where predicted = 0, observed = 1. This is 13.43% in the above validation set performance.

    Gradient boosting with oversampling - hypertuning with Randomized Search CV

    Adaboost with oversampled data with hyperparameter tuning

    Observation

    The best model was gradient boosting with a recall of 89%

    Insights

    1. The mean customer age is 46, and 75% of the customers are 52 years or less.We can infer that the customer base for the bank products is relatively young and consists of majority of customers who are not retired from professional life.
    2. 75% of the customers have less than 2 years of relationship with the bank - this confirms the high attrition rate.
    3. 75% of the customers had a total transaction amount of about 4741 dollars in the last 12 months. The maximum was 18,484 dollars.This indicates that majority of the customers are not big spenders using a credit card.
    4. The median credit limit for females is considerably less than the median credit limit for males, while 52% of the total customers are female.
    5. There are a large number of customers (around 2500) who have 0 dollars as the amount that they carry over to the next month. These customers also comprise of attrited customers. Overall, 50% of the attrited customers have a range of 0 dollars to 1200 dollars which is less than the median carry over amount for the existing customers. This can indicate that the card services/fees/penalties for the attrited customers may have been a factor for their attrition.
    6. The attrited customers have a lower range and median of average utilization ratio as compared to existing customers. This can be interpreted that the customers not using their available credit are more likely to attrite
    7. The male customer base had more customers in the 80-120 K annunal income range while for females, the large group of customers was less than 40K per annum.
    8. Overall, the profile of the attrited customer is likely to be (a) female (b) graduate (c) holding a blue card category type product (d) making less than 40,000 dollars per annum.

    Recommendations for strategies for the bank

    1. Since 75% of the customers are not big spenders using credit card, the bank should focus on having card services that are appropriate for smaller amounts of spending - such as reasonable annual fees and penalties. This will be a strategy to pursue also since 35% of the customers have an annual income of less than 40,000 dollars.
    2. Only 20 customers out of the 10127 customers had the card type "Platinum" indicating that this card type was not popular with the existing customer base. The customers with this card category have a credit limit of more than 30,000 dollars - hence it is crucial that these are marketed to the correct income group (i.e. higher income group).
    3. The upper credit limit for Gold, Silver, and Platinum is around 35,000 dollars. However, since 35% of the customers have an income less than 40,000 dollars per annum - one of the strategies the bank can adopt is to seek out higher income customers and offer a referral service to their existing platinum customers so as to bring in new platinum customers.
    4. Another strategy for the bank to pursue is to review their fees/penalities/services for their customers who tend to have a total revolving balance of 0 to 1800 dollars so that the attrition rate from this group can be lowered. These customers are likely to be female customers, or single customers - and the bank can advertize alert systems such that these customers become aware when their total revolving balance gets closer to 0 dollars. This will prevent customers being surprised with fees/penalties which may drive them to attrite and shop for another bank's product. A strategy such as this can help the bank lower their attrite rate and the bank can continue to generate revenue based on services.
    5. Alternatively, the bank can target married/divorced customers with higher income per annum as these customers are less likely to reach a total revolving balance of 0 dollars are thus less likely to attrite.